Post-Training LLMs for Human Alignment

A practical comparison of SFT, RLHF, DPO, ORPO, KTO, and GRPO for aligning pretrained language models with human preferences

Published

December 10, 2024

Keywords: alignment, RLHF, DPO, ORPO, KTO, GRPO, SFT, PPO, preference optimization, human feedback, reward model, TRL, fine-tuning, small models, transformers

Introduction

Pretrained language models learn broad knowledge from massive text corpora, but they don’t inherently follow instructions or behave safely. Post-training alignment bridges this gap — teaching models to produce helpful, harmless, and honest responses that match human expectations.

The alignment pipeline has evolved rapidly. Early methods like RLHF required training a separate reward model and running complex reinforcement learning. Newer approaches like DPO, ORPO, and GRPO simplify this process significantly, making alignment accessible even on consumer hardware with small models.

This article compares six key alignment methods: SFT, RLHF (PPO), DPO, KTO, ORPO, and GRPO. All code examples use small models (0.5B–1B parameters) with the TRL library.

For fine-tuning fundamentals, see Fine-tuning an LLM with Unsloth and Serving with Ollama. For model compression after alignment, see Quantization Methods for LLMs. For decoding strategies during inference, see Decoding Methods for Text Generation with LLMs.

The Alignment Pipeline Overview

Before diving into individual methods, here is how post-training fits into the LLM lifecycle:

graph LR
    A["Pretraining<br/>(next-token prediction<br/>on large corpus)"] --> B["SFT<br/>(instruction<br/>fine-tuning)"]
    B --> C["Preference Alignment<br/>(RLHF / DPO / ORPO<br/>/ KTO / GRPO)"]
    C --> D["Deployment<br/>(quantization,<br/>serving)"]

    style A fill:#4a90d9,color:#fff,stroke:#333
    style B fill:#f5a623,color:#fff,stroke:#333
    style C fill:#e74c3c,color:#fff,stroke:#333
    style D fill:#27ae60,color:#fff,stroke:#333

Most alignment methods require two stages: first SFT to teach the model to follow instructions, then preference optimization to refine behavior. ORPO is unique in merging both stages into one.

1. Supervised Fine-Tuning (SFT)

SFT is the foundational first step. The model is trained on (instruction, response) pairs using standard cross-entropy loss, learning to follow instructions and produce structured outputs.

graph TD
    A["Pretrained Base Model"] --> B["Instruction Dataset<br/>(prompt → response pairs)"]
    B --> C["Cross-Entropy Loss<br/>on target tokens"]
    C --> D["SFT Model<br/>(follows instructions)"]

    style A fill:#4a90d9,color:#fff,stroke:#333
    style B fill:#f5a623,color:#fff,stroke:#333
    style C fill:#e74c3c,color:#fff,stroke:#333
    style D fill:#27ae60,color:#fff,stroke:#333

How It Works

The model receives a prompt (instruction) and is trained to predict the expected response token by token.
Loss is computed only on the response tokens, not the prompt tokens.
Common datasets: Alpaca, OpenAssistant, UltraChat.

Code Example with TRLfrom trl import SFTTrainer, SFTConfig
from datasets import load_dataset

dataset = load_dataset("trl-lib/Capybara", split="train")

training_args = SFTConfig(
    output_dir="Qwen2.5-0.5B-SFT",
    num_train_epochs=1,
    per_device_train_batch_size=4,
    learning_rate=2e-5,
    max_seq_length=1024,
)

trainer = SFTTrainer(
    model="Qwen/Qwen2.5-0.5B",
    args=training_args,
    train_dataset=dataset,
)
trainer.train()

Limitations

SFT teaches the model what to say but not how to discriminate between good and bad outputs. It increases the probability of both preferred and undesired response patterns. This is why a preference alignment stage is needed.

2. RLHF with PPO (Reinforcement Learning from Human Feedback)

RLHF is the classic alignment method, famously used to build InstructGPT and ChatGPT. It involves training a separate reward model on human preference data, then using Proximal Policy Optimization (PPO) to maximize the reward while staying close to the original model.

graph TD
    subgraph Stage1["Stage 1: Reward Model Training"]
        direction TB
        A1["Human Annotators<br/>rank responses"] --> A2["Preference Dataset<br/>(prompt, chosen, rejected)"]
        A2 --> A3["Train Reward Model<br/>(Bradley-Terry)"]
    end

    subgraph Stage2["Stage 2: PPO Fine-Tuning"]
        direction TB
        B1["SFT Model generates<br/>responses to prompts"] --> B2["Reward Model<br/>scores responses"]
        B2 --> B3["PPO updates policy<br/>maximize reward - β·KL"]
    end

    Stage1 --> Stage2

    style A1 fill:#4a90d9,color:#fff,stroke:#333
    style A2 fill:#f5a623,color:#fff,stroke:#333
    style A3 fill:#e74c3c,color:#fff,stroke:#333
    style B1 fill:#4a90d9,color:#fff,stroke:#333
    style B2 fill:#f5a623,color:#fff,stroke:#333
    style B3 fill:#27ae60,color:#fff,stroke:#333

How It Works

Reward Model: Trained on pairs of (chosen, rejected) responses. It learns to assign higher scores to human-preferred outputs using the Bradley-Terry ranking model.
PPO Optimization: The policy (SFT model) generates responses, the reward model scores them, and PPO updates the policy to maximize reward while a KL divergence penalty prevents the model from drifting too far from the reference (SFT) model.

The objective is:

\max_\pi \mathbb{E}_{x \sim D, y \sim \pi}[R(x, y)] - \beta \cdot D_{KL}[\pi \| \pi_{\text{ref}}]

Key Components

Component	Role
Policy model	The LLM being optimized
Reference model	Copy of SFT model (frozen), prevents reward hacking
Reward model	Scores generated outputs
Value model	Estimates expected future rewards for PPO

Limitations

Requires 3–4 models in memory simultaneously (policy, reference, reward, value)
Training is unstable — sensitive to hyperparameters
Reward model can be gamed (reward hacking)
Complex engineering pipeline

3. DPO (Direct Preference Optimization)

DPO eliminates the need for a separate reward model by directly optimizing the policy on preference data. The key insight: the optimal RL policy can be expressed in closed form given the reward function, so we can reparametrize the reward model loss as a policy loss.

graph TD
    A["SFT Model<br/>(policy + reference)"] --> B["Preference Dataset<br/>(prompt, chosen, rejected)"]
    B --> C["DPO Loss<br/>binary cross-entropy<br/>on log-probability ratios"]
    C --> D["Aligned Model<br/>(no reward model needed)"]

    style A fill:#4a90d9,color:#fff,stroke:#333
    style B fill:#f5a623,color:#fff,stroke:#333
    style C fill:#e74c3c,color:#fff,stroke:#333
    style D fill:#27ae60,color:#fff,stroke:#333

How It Works

DPO defines the loss directly on preference pairs:

\mathcal{L}_{\text{DPO}}(\theta) = -\mathbb{E}_{(x, y^+, y^-)} \left[\log \sigma\left(\beta \left(\log \frac{\pi_\theta(y^+ | x)}{\pi_{\text{ref}}(y^+ | x)} - \log \frac{\pi_\theta(y^- | x)}{\pi_{\text{ref}}(y^- | x)}\right)\right)\right]

In practice, DPO increases the relative probability of the chosen response and decreases that of the rejected one, all while staying close to the reference model. The hyperparameter \beta controls the strength of the preference signal (typical values: 0.1–0.5).

Code Example with TRLfrom trl import DPOTrainer, DPOConfig
from datasets import load_dataset
from peft import LoraConfig

dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")

training_args = DPOConfig(
    output_dir="Qwen2.5-0.5B-DPO",
    num_train_epochs=1,
    per_device_train_batch_size=4,
    learning_rate=1e-6,
    beta=0.1,
    max_length=1024,
)

trainer = DPOTrainer(
    model="Qwen/Qwen2.5-0.5B-Instruct",
    args=training_args,
    train_dataset=dataset,
    peft_config=LoraConfig(r=16, lora_alpha=32),
)
trainer.train()

Advantages over RLHF

No reward model needed — only 2 models in memory (policy + reference)
Stable training with simple classification loss
Much simpler to implement
Comparable or better results on benchmarks

Dataset Format

DPO expects preference data with three fields:

# Standard format
{"prompt": "What is AI?",
 "chosen": "AI is a branch of computer science...",
 "rejected": "AI is when computers become sentient..."}

# Conversational format
{"prompt": [{"role": "user", "content": "What is AI?"}],
 "chosen": [{"role": "assistant", "content": "AI is a branch of..."}],
 "rejected": [{"role": "assistant", "content": "AI is when..."}]}

4. KTO (Kahneman-Tversky Optimization)

KTO removes the requirement of paired preference data. Instead of needing (chosen, rejected) pairs for the same prompt, KTO works with individual examples labeled as simply “good” or “bad” — like a thumbs up/thumbs down signal.

graph TD
    A["SFT Model"] --> B["Unpaired Feedback<br/>👍 good examples<br/>👎 bad examples"]
    B --> C["KTO Loss<br/>(Kahneman-Tversky<br/>value function)"]
    C --> D["Aligned Model"]

    style A fill:#4a90d9,color:#fff,stroke:#333
    style B fill:#f5a623,color:#fff,stroke:#333
    style C fill:#e74c3c,color:#fff,stroke:#333
    style D fill:#27ae60,color:#fff,stroke:#333

How It Works

KTO is based on prospect theory from behavioral economics (Kahneman & Tversky). It models the human tendency to weigh losses more heavily than equivalent gains. The loss function treats “good” and “bad” examples independently:

For good examples: maximize the utility of the model’s improvement over the reference
For bad examples: penalize using a loss-averse weighting (losses hurt more than gains feel good)

When to Use KTO

Scenario	Recommended?
You have paired preference data	DPO is generally better
You only have thumbs up/down labels	KTO is ideal
Production chatbot with user feedback	KTO is ideal
Creating preference data is expensive	KTO is ideal

KTO is particularly practical when collecting production feedback — user thumbs up/down ratings are much easier to collect than pairwise comparisons.

5. ORPO (Odds Ratio Preference Optimization)

ORPO is unique: it combines SFT and preference alignment into a single training step. Instead of first doing SFT then DPO, ORPO adds an odds ratio penalty to the standard NLL (negative log-likelihood) loss, achieving both instruction-following and preference alignment simultaneously.

graph TD
    A["Pretrained Base Model"] --> B["Preference Dataset<br/>(prompt, chosen, rejected)"]
    B --> C["ORPO Loss<br/>= NLL + λ·OR Loss"]

    subgraph LossComponents["Loss Components"]
        direction LR
        D["NLL Loss<br/>(SFT signal on<br/>chosen response)"]
        E["Odds Ratio Loss<br/>(penalize rejected,<br/>reward chosen)"]
    end

    C --> LossComponents
    LossComponents --> F["Aligned Model<br/>(single stage!)"]

    style A fill:#4a90d9,color:#fff,stroke:#333
    style B fill:#f5a623,color:#fff,stroke:#333
    style C fill:#e74c3c,color:#fff,stroke:#333
    style D fill:#4a90d9,color:#fff,stroke:#333
    style E fill:#f5a623,color:#fff,stroke:#333
    style F fill:#27ae60,color:#fff,stroke:#333

How It Works

The ORPO objective is:

\mathcal{L}_{\text{ORPO}} = \mathbb{E}_{(x, y^+, y^-)} \left[\mathcal{L}_{\text{SFT}}(x, y^+) + \lambda \cdot \mathcal{L}_{\text{OR}}(x, y^+, y^-)\right]

Where \mathcal{L}_{\text{OR}} is the odds ratio loss that contrasts the likelihood of chosen vs. rejected responses. The NLL component handles instruction following (like SFT), while the OR component handles preference alignment.

Key Advantages

No reference model required — saves 50% memory compared to DPO
Single stage — no separate SFT step
Computationally efficient — fewer total training steps
Tested from 125M to 7B parameters

Code Example with TRLfrom trl.experimental.orpo import ORPOTrainer, ORPOConfig
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset

model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-0.5B-Instruct")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-0.5B-Instruct")
dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")

training_args = ORPOConfig(
    output_dir="Qwen2-0.5B-ORPO",
    num_train_epochs=1,
    per_device_train_batch_size=4,
    learning_rate=8e-6,
    beta=0.1,  # λ: weight of the OR loss
    max_length=1024,
)

trainer = ORPOTrainer(
    model=model,
    args=training_args,
    processing_class=tokenizer,
    train_dataset=dataset,
)
trainer.train()

6. GRPO (Group Relative Policy Optimization)

GRPO was introduced by DeepSeek for enhancing mathematical reasoning. Unlike DPO which uses offline preference data, GRPO is an online RL method that generates multiple completions per prompt, scores them with a reward function, and uses the relative ranking within each group to compute advantages — all without a separate value model.

graph TD
    A["Policy Model"] --> B["Generate G completions<br/>per prompt"]
    B --> C["Score with<br/>Reward Function"]
    C --> D["Compute Group-Relative<br/>Advantages<br/>A = (r - mean) / std"]
    D --> E["PPO-style Update<br/>with clipped objective"]

    style A fill:#4a90d9,color:#fff,stroke:#333
    style B fill:#f5a623,color:#fff,stroke:#333
    style C fill:#e74c3c,color:#fff,stroke:#333
    style D fill:#9b59b6,color:#fff,stroke:#333
    style E fill:#27ae60,color:#fff,stroke:#333

How It Works

For each prompt, generate G completions (e.g., G=8)
Score each completion with a reward function (can be a model or a rule-based function)
Compute group-relative advantage: normalize rewards within the group to get relative quality
Update the policy using a clipped surrogate objective (like PPO, but without a value model)

The advantage for completion i is:

\hat{A}_i = \frac{r_i - \text{mean}(\mathbf{r})}{\text{std}(\mathbf{r})}

Key Innovation

GRPO replaces the value model used in PPO with group-relative normalization, making it:

Memory-efficient: No value model needed
Self-improving: Uses model’s own generations for training (online RL)
Flexible: Works with any reward function — including rule-based rewards (no neural reward model required)

Code Example with TRLfrom trl import GRPOTrainer, GRPOConfig
from trl.rewards import accuracy_reward
from datasets import load_dataset

dataset = load_dataset("trl-lib/DeepMath-103K", split="train")

training_args = GRPOConfig(
    output_dir="Qwen2.5-0.5B-GRPO",
    num_train_epochs=1,
    per_device_train_batch_size=4,
    learning_rate=1e-6,
    num_generations=8,
    max_completion_length=256,
)

trainer = GRPOTrainer(
    model="Qwen/Qwen2.5-0.5B-Instruct",
    reward_funcs=accuracy_reward,
    args=training_args,
    train_dataset=dataset,
)
trainer.train()

Custom Reward Functions

One of GRPO’s strengths is its support for rule-based reward functions — no neural reward model needed:

import re

def format_reward(completions, **kwargs):
    """Reward for structured <think>...</think><answer>...</answer> format."""
    pattern = r"^<think>.*?</think><answer>.*?</answer>$"
    return [1.0 if re.match(pattern, c) else 0.0 for c in completions]

def length_reward(completions, **kwargs):
    """Reward longer, more detailed responses."""
    return [min(len(c) / 500, 1.0) for c in completions]

# Combine multiple reward functions
trainer = GRPOTrainer(
    model="Qwen/Qwen2.5-0.5B-Instruct",
    reward_funcs=[format_reward, length_reward],
    reward_weights=[1.0, 0.5],
    ...
)

Method Comparison

graph TD
    A{{"Do you have<br/>preference data?"}}
    A -->|"No, only instructions"| B["SFT"]
    A -->|"Yes"| C{{"Paired or<br/>unpaired?"}}
    C -->|"Unpaired<br/>(thumbs up/down)"| D["KTO"]
    C -->|"Paired<br/>(chosen/rejected)"| E{{"Want single-stage<br/>training?"}}
    E -->|"Yes"| F["ORPO"]
    E -->|"No"| G{{"Online or<br/>offline RL?"}}
    G -->|"Offline<br/>(fixed dataset)"| H["DPO"]
    G -->|"Online<br/>(model generates)"| I{{"Need rule-based<br/>rewards?"}}
    I -->|"Yes"| J["GRPO"]
    I -->|"No, have<br/>reward model"| K["RLHF (PPO)"]

    style A fill:#e74c3c,color:#fff,stroke:#333
    style B fill:#27ae60,color:#fff,stroke:#333
    style C fill:#e74c3c,color:#fff,stroke:#333
    style D fill:#27ae60,color:#fff,stroke:#333
    style E fill:#e74c3c,color:#fff,stroke:#333
    style F fill:#27ae60,color:#fff,stroke:#333
    style G fill:#e74c3c,color:#fff,stroke:#333
    style H fill:#27ae60,color:#fff,stroke:#333
    style I fill:#e74c3c,color:#fff,stroke:#333
    style J fill:#27ae60,color:#fff,stroke:#333
    style K fill:#27ae60,color:#fff,stroke:#333

Summary Table

Method	Type	Models in Memory	Needs Reward Model	Needs Reference Model	Data Requirement	Key Library
SFT	Supervised	1	No	No	(instruction, response)	TRL `SFTTrainer`
RLHF (PPO)	Online RL	3–4	Yes	Yes	Preference pairs + reward model	TRL `PPOTrainer`
DPO	Offline	2	No	Yes	Preference pairs	TRL `DPOTrainer`
KTO	Offline	2	No	Yes	Unpaired good/bad labels	TRL `DPOTrainer` (loss_type=“kto_pair”)
ORPO	Offline	1	No	No	Preference pairs	TRL `ORPOTrainer`
GRPO	Online RL	1–2	Optional	Optional	Prompts + reward function	TRL `GRPOTrainer`

Hyperparameter Sensitivity

The \beta parameter is critical across methods. Empirical studies show:

Method	Recommended β Range	Notes
DPO	0.01 – 0.5	Lower β often works best; 0.1 is a common default
KTO	0.01 – 0.3	Similar trends to DPO
ORPO	0.1 (λ)	Controls the weight of the odds ratio loss
GRPO	0.0 – 0.001	Recent work suggests β=0 (no KL penalty) works well

Practical Recommendations

Resource-Constrained Settings

For training on a single consumer GPU (16–24 GB VRAM) with small models:

Start with SFT using QLoRA to teach instruction following
Apply DPO or ORPO with LoRA adapters for preference alignment
Use 4-bit quantization (bitsandbytes) to fit both policy and reference models

When to Use Each Method

SFT only: When you have high-quality instruction data and just need a helpful assistant
DPO: Best general-purpose alignment method — simple, stable, well-tested
ORPO: When compute is limited and you want a single-stage pipeline
KTO: When you only have binary feedback (production chatbot settings)
GRPO: For reasoning tasks (math, code) where you can define verifiable reward functions
RLHF: When you have the infrastructure and need maximum control over the reward signal

Training with LoRA/QLoRA

All methods support parameter-efficient fine-tuning:

from peft import LoraConfig

peft_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    task_type="CAUSAL_LM",
)

# Pass to any TRL trainer
trainer = DPOTrainer(
    model="Qwen/Qwen2.5-0.5B-Instruct",
    peft_config=peft_config,
    ...
)

Evolution of Alignment Methods

graph LR
    A["RLHF + PPO<br/>(2022)<br/>InstructGPT"] --> B["DPO<br/>(2023)<br/>No reward model"]
    B --> C["KTO<br/>(2024)<br/>Unpaired data"]
    B --> D["IPO<br/>(2023)<br/>Regularized DPO"]
    A --> E["ORPO<br/>(2024)<br/>Single-stage"]
    A --> F["GRPO<br/>(2024)<br/>DeepSeek-Math"]
    F --> G["DeepSeek-R1<br/>(2025)<br/>Reasoning RL"]

    style A fill:#4a90d9,color:#fff,stroke:#333
    style B fill:#e74c3c,color:#fff,stroke:#333
    style C fill:#f5a623,color:#fff,stroke:#333
    style D fill:#9b59b6,color:#fff,stroke:#333
    style E fill:#27ae60,color:#fff,stroke:#333
    style F fill:#e67e22,color:#fff,stroke:#333
    style G fill:#1abc9c,color:#fff,stroke:#333

Conclusion

Human alignment has rapidly evolved from the complex RLHF pipeline to simpler, more efficient methods. DPO remains the most popular general-purpose method due to its stability and simplicity. ORPO offers an attractive single-stage alternative. GRPO is emerging as the method of choice for reasoning tasks, especially after its success in DeepSeek-R1.

The choice of method depends on your data, compute, and use case. For most practitioners starting out, the recommended path is:

SFT with LoRA on instruction data
DPO with LoRA on preference data
Quantize and deploy

For serving your aligned model, see Run LLM locally with Ollama or Deploying and Serving LLM with vLLM.

References

Ouyang et al., Training language models to follow instructions with human feedback (InstructGPT), 2022. arXiv:2203.02155
Rafailov et al., Direct Preference Optimization: Your Language Model is Secretly a Reward Model, 2023. arXiv:2305.18290
Ethayarajh et al., KTO: Model Alignment as Prospect Theoretic Optimization, 2024. GitHub
Azar et al., A General Theoretical Paradigm to Understand Learning from Human Feedback (IPO), 2023. arXiv:2310.12036
Hong et al., ORPO: Monolithic Preference Optimization without Reference Model, 2024. arXiv:2403.07691
Shao et al., DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models (GRPO), 2024. arXiv:2402.03300
DeepSeek-AI, DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning, 2025. arXiv:2501.12948
von Werra et al., TRL: Transformer Reinforcement Learning. GitHub

Explore TRL documentation for the latest trainers and features
Try the Hugging Face alignment-handbook for production recipes
Experiment with GRPO + custom reward functions for domain-specific reasoning tasks

Introduction

The Alignment Pipeline Overview

1. Supervised Fine-Tuning (SFT)

How It Works

Code Example with TRL

Limitations

2. RLHF with PPO (Reinforcement Learning from Human Feedback)

How It Works

Key Components

Limitations

3. DPO (Direct Preference Optimization)

How It Works

Code Example with TRL

Advantages over RLHF

Dataset Format

4. KTO (Kahneman-Tversky Optimization)

How It Works

When to Use KTO

5. ORPO (Odds Ratio Preference Optimization)

How It Works

Key Advantages

Code Example with TRL

6. GRPO (Group Relative Policy Optimization)

How It Works

Key Innovation

Code Example with TRL

Custom Reward Functions

Method Comparison

Summary Table

Hyperparameter Sensitivity

Practical Recommendations

Resource-Constrained Settings

When to Use Each Method

Training with LoRA/QLoRA

Evolution of Alignment Methods

Conclusion

References

Read More